Goto

Collaborating Authors

 Simcoe County


Evaluation of the phi-3-mini SLM for identification of texts related to medicine, health, and sports injuries

Brogly, Chris, Rjaibi, Saif, Liang, Charlotte, Lam, Erica, Wang, Edward, Levitan, Adam, Paleczny, Sarah, Cusimano, Michael

arXiv.org Artificial Intelligence

Small Language Models (SLMs) have potential to be used for automatically labelling and identifying aspects of text data for medicine/health-related purposes from documents and the web. As their resource requirements are significantly lower than Large Language Models (LLMs), these can be deployed potentially on more types of devices. SLMs often are benchmarked on health/medicine-related tasks, such as MedQA, although performance on these can vary especially depending on the size of the model in terms of number of parameters. Furthermore, these test results may not necessarily reflect real-world performance regarding the automatic labelling or identification of texts in documents and the web. As a result, we compared topic-relatedness scores from Microsofts phi-3-mini-4k-instruct SLM to the topic-relatedness scores from 7 human evaluators on 1144 samples of medical/health-related texts and 1117 samples of sports injury-related texts. These texts were from a larger dataset of about 9 million news headlines, each of which were processed and assigned scores by phi-3-mini-4k-instruct. Our sample was selected (filtered) based on 1 (low filtering) or more (high filtering) Boolean conditions on the phi-3 SLM scores. We found low-moderate significant correlations between the scores from the SLM and human evaluators for sports injury texts with low filtering (\r{ho} = 0.3413, p < 0.001) and medicine/health texts with high filtering (\r{ho} = 0.3854, p < 0.001), and low significant correlation for medicine/health texts with low filtering (\r{ho} = 0.2255, p < 0.001). There was negligible, insignificant correlation for sports injury-related texts with high filtering (\r{ho} = 0.0318, p = 0.4466).


Classification of worldwide news articles by perceived quality, 2018-2024

McElroy, Connor, de Oliveira, Thiago E. A., Brogly, Chris

arXiv.org Artificial Intelligence

This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.


Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Wang, Ziliang, An, Kang, Zheng, Xuhui, Qian, Faqiang, Zhang, Weikun, Ouyang, Cijun, Cai, Jialu, Wang, Yuhang, Wu, Yichao

arXiv.org Artificial Intelligence

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.


Do small language models generate realistic variable-quality fake news headlines?

McCutcheon, Austin, Brogly, Chris

arXiv.org Artificial Intelligence

Small language models (SLMs) have the capability for text generation and may potentially be used to generate falsified texts online. This study evaluates 14 SLMs (1.7B-14B parameters) including LLaMA, Gemma, Phi, SmolLM, Mistral, and Granite families in generating perceived low and high quality fake news headlines when explicitly prompted, and whether they appear to be similar to real-world news headlines. Using controlled prompt engineering, 24,000 headlines were generated across low-quality and high-quality deceptive categories. Existing machine learning and deep learning-based news headline quality detectors were then applied against these SLM-generated fake news headlines. SLMs demonstrated high compliance rates with minimal ethical resistance, though there were some occasional exceptions. Headline quality detection using established DistilBERT and bagging classifier models showed that quality misclassification was common, with detection accuracies only ranging from 35.2% to 63.5%. These findings suggest the following: tested SLMs generally are compliant in generating falsified headlines, although there are slight variations in ethical restraints, and the generated headlines did not closely resemble existing primarily human-written content on the web, given the low quality classification accuracy.


Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024

McCutcheon, Austin, de Oliveira, Thiago E. A., Zheleznov, Aleksandr, Brogly, Chris

arXiv.org Artificial Intelligence

The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results suggest that both NLP features with traditional classifiers and deep learning models can effectively differentiate perceived news headline/link quality, with some trade-off between predictive performance and train time.


Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis

Brogly, Chris, McElroy, Connor

arXiv.org Artificial Intelligence

The release of advanced Large Language Models (LLMs) such as ChatGPT and Copilot is changing the way text is created and may influence the content that we find on the web. This study investigated whether the release of these two popular LLMs coincided with a change in writing style in headlines and links on worldwide news websites. 175 NLP features were obtained for each text in a dataset of 451 million headlines/links. An interrupted time series analysis was applied for each of the 175 NLP features to evaluate whether there were any statistically significant sustained changes after the release dates of ChatGPT and/or Copilot. There were a total of 44 features that did not appear to have any significant sustained change after the release of ChatGPT/Copilot. A total of 91 other features did show significant change with ChatGPT and/or Copilot although significance with earlier control LLM release dates (GPT-1/2/3, Gopher) removed them from consideration. This initial analysis suggests these language models may have had a limited impact on the style of individual news headlines/links, with respect to only some NLP measures.


Investigation of the Impact of Economic and Social Factors on Energy Demand through Natural Language Processing

Bai, Yun, Camal, Simon, Michiorri, Andrea

arXiv.org Artificial Intelligence

These authors contributed equally to this work. Abstract The relationship between energy demand and variables such as economic activity and weather is well established. However, this paper aims to explore the connection between energy demand and other social aspects, which receive little attention. Through the use of natural language processing on a large news corpus, we shed light on this important link. This study was carried out in five regions of the UK and Ireland and considers multiple horizons from 1 to 30 days. It also considers economic variables such as GDP, unemployment and inflation. We found that: 1) News about military conflicts, transportation, the global pandemic, regional economics, and the international energy market are related to electricity demand. Electricity demand modelling is a fundamental process in power system planning, operation, and energy trading [1]. In order to avoid additional carbon emissions from excess electricity generation and the high costs of electricity storage, electricity demand and supply should be matched over time [2]. Demand forecasting has become a means of enabling power dispatch, planning generation schedules, and integrating renewable energy sources [3]. Electricity demand forecasting is linked to various factors, including weather, economic activity, and major events.


Leveraging Compliant Tactile Perception for Haptic Blind Surface Reconstruction

Cheret, Laurent Yves Emile Ramos, da Fonseca, Vinicius Prado, de Oliveira, Thiago Eustaquio Alves

arXiv.org Artificial Intelligence

Non-flat surfaces pose difficulties for robots operating in unstructured environments. Reconstructions of uneven surfaces may only be partially possible due to non-compliant end-effectors and limitations on vision systems such as transparency, reflections, and occlusions. This study achieves blind surface reconstruction by harnessing the robotic manipulator's kinematic data and a compliant tactile sensing module, which incorporates inertial, magnetic, and pressure sensors. The module's flexibility enables us to estimate contact positions and surface normals by analyzing its deformation during interactions with unknown objects. While previous works collect only positional information, we include the local normals in a geometrical approach to estimate curvatures between adjacent contact points. These parameters then guide a spline-based patch generation, which allows us to recreate larger surfaces without an increase in complexity while reducing the time-consuming step of probing the surface. Experimental validation demonstrates that this approach outperforms an off-the-shelf vision system in estimation accuracy. Moreover, this compliant haptic method works effectively even when the manipulator's approach angle is not aligned with the surface normals, which is ideal for unknown non-flat surfaces.


Driverless cars: Researcher disguises himself as car seat in study

BBC News

A study to test people's reactions to driverless cars has used a "ghost driver" to record their responses. The work, by the University of Nottingham, found that, in the absence of someone in the driving seat, pedestrians trust certain visual prompts more than others when deciding whether to cross the road. As part of the study, a car was driven around the university's campus over several days with its driver - research fellow David R. Large - concealed in the driver's seat. Mr Large, senior research fellow with the Human Factors Research Group at the university, said: "We wanted to explore how pedestrians would interact with a driverless car and developed this unique methodology to explore their reactions." Follow BBC East Midlands on Facebook, on Twitter, or on Instagram.


Chatsworth's hidden 17th Century garden revealed in drone footage

BBC News

A hidden 17th Century garden that emerged during a heatwave has been shown in new drone footage. The European-style formal garden at the Chatsworth Estate in Derbyshire was designed in 1699 for the 1st Duke of Devonshire. It was grassed over 30 years later but substantial remains lie buried under just a thin layer of soil and grass, which has since been parched by the recent dry weather. While the historic design will not be fully restored any time soon, Steve Porter - head of gardens and landscape at Chatsworth - said he hoped the old garden, known as the Great Parterre, could be recreated with gravel once the grass had recovered. "Every time you look you almost see more of the detail, more of the scrolls of the beds and more of the paths and it sort of brings it all back to life and you realise just how intricate and just how amazing it would have been," he added. Follow BBC East Midlands on Facebook, Twitter, or Instagram.